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SUMMARY  This  paper  first  reviews  the  trends  of  VLSI  design,  focus¬ 
ing  on  the  power  dissipation  and  programmability.  Then,  we  show  the 
advantage  of  Quarternary  Decision  Diagrams  (QDDs)  in  representing  and 
evaluating  logic  functions.  That  is,  we  show  how  QDDs  are  used  to  imple¬ 
ment  QDD  machines,  which  yield  high-speed  implementations.  We  com¬ 
pare  QDD  machines  with  binary  decision  diagram  (BDD)  machines,  and 
show  a  speed  improvement  of  1.28-2.02  times  when  QDDs  are  chosen.  We 
consider  l-and  2-address  BDD  machines,  and  3-  and  4-address  QDD  ma¬ 
chines,  and  we  show  a  method  to  minimize  the  number  of  instructions. 
key  words:  quarternary  decision  diagram,  branching  program  machine 

1.  Trends  of  VLSI  Design 

1 . 1  Explosion  of  Complexity 

With  the  growth  of  multimedia  and  other  applications,  the 
demand  for  high-performance  processors  has  increased.  In 
the  past,  Moore’s  Law  solved  this  problem.  Moore’s  Law 
states  that  the  number  of  transistors  on  a  chip  doubles  every 
18  months. 

In  the  process  of  miniaturization,  the  scaling  down  of 
transistor  size  and  chip  area  has  reduced  power  dissipation. 
That  is,  by  scaling  down  the  transistor  size  in  LSIs,  chip 
area,  delay,  and  power  dissipation  can  be  reduced  at  the 
same  time.  However,  in  the  future,  the  number  of  transis¬ 
tors  on  a  chip  is  expected  to  fall  short  of  that  predicted  by 
Moore’s  Law. 

1.2  Power  Dissipation 

As  transistor  size  decreases,  supply  voltage  must  also  scale 
down  to  keep  the  electric  field  in  the  integrated  circuit  con¬ 
stant  [32].  However,  as  the  supply  voltage  decreases,  sub¬ 
threshold  leakage  current  increases.  Nowadays,  power  dis¬ 
sipation  due  to  leakage  current  accounts  for  about  40%  of 
the  total  power  dissipation  in  a  microprocessor  [5].  There¬ 
fore,  as  supply  voltage  is  reduced,  power  density  is  a  limit¬ 
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ing  factor.  With  an  increase  of  the  power  density,  the  tem¬ 
perature  of  chip  may  become  too  high.  To  make  matters 
worse,  leakage  current  increases  exponentially  with  temper¬ 
ature  [3].  When  a  transistor  produces  more  heat  than  the 
heatsink  can  dissipate,  thermal  runaway  occurs.  Therefore, 
cooling  is  very  important.  In  the  past,  reduction  of  chip  area 
was  the  main  design  issue.  However,  nowadays,  the  reduc¬ 
tion  of  power  dissipation  is  the  primary  design  issue.  In  mo¬ 
bile  applications,  battery  size  is  limited,  so  the  use  of  low 
power  devices  is  crucial. 

1.3  Multi-Core  and  Parallel  Processing 

Power  dissipation  of  a  CMOS  gate  is  approximately 

P  =  axV2ddxf, 

where  a  is  a  constant,  V&d  is  the  supply  voltage,  and  /  is  the 
clock  frequency. 

Reduction  of  the  supply  voltage  without  changing  tran¬ 
sistor  dimensions  requires  a  reduction  in  clock  frequency 
/  [4].  Assume  that  the  power  supply  voltage  is  reduced  by 
30%,  and  that  the  clock  frequency  is  reduced  by  50%.  In 
this  case,  we  have 

a  x  (fl.lVdd)2  X  0.5/  =  0.25 aVhf. 

Consider  a  dual  core  version  of  this,  as  shown  in  Fig.  1. 
In  this  case,  a  reduction  by  half  of  the  frequency  is  compen¬ 
sated  by  an  increase  by  two  times  of  the  number  of  proces¬ 
sors,  yielding  nearly  equal  throughput.  That  is,  this  change 
has  resulted  in  a  reduction  by  half  of  the  power  with  no 
change  in  the  system  throughput. 

In  personal  computers,  many  threads  are  running  at  the 
same  time.  Thus,  many  computers  can  benefit  from  multi¬ 
cores.  In  this  sense,  chip  area  is  increased  to  reduce  power 
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Fig.  1  Using  a  dual  core  processor  to  reduce  the  power  by  half. 
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dissipation.  Increasing  the  number  of  cores  increases  the 
chip  cost,  but  the  reduction  of  power  dissipation  is  more  im¬ 
portant. 

By  reducing  power,  cooling  fans  can  be  often  elim¬ 
inated  [2].  Also,  reliability  will  be  enhanced  because  of 
lower  temperatures.  Excessively  high  temperature  can  burn 
out  the  chip.  Even  if  the  temperature  is  low  enough  so  that 
this  does  not  occur,  high  temperature  can  cause  cumulative 
damage. 

In  multi-core  systems,  unused  cores  can  be  turned  off  to 
further  reduce  power  dissipation.  Unfortunately,  developing 
efficient  software  for  multi-core  is  not  so  easy.  Most  existing 
software  is  single-threaded.  In  a  single  core  processor,  var¬ 
ious  methods  are  used  to  increase  the  performance  without 
increasing  the  clock  frequency,  including  pipelining,  super 
scalar,  super  pipeline  architecture,  and  very  long  instruction 
word  processors  (VLIWs).  Unfortunately,  even  if  the  chip 
area  of  a  single-core  processor  is  doubled  to  increase  the 
performance,  the  resulting  performance  is  increased  only  by 
1.4  times,  as  predicted  by  Pollack’s  rule  [4]. 

1 .4  Programmable  Device 

With  the  miniaturization  of  chips,  the  cost  of  masks  for 
VLSI  has  increased  drastically.  Since  the  number  of  tran¬ 
sistors  has  increased,  VLSI  design  is  now  very  complicated. 
As  transistors  become  smaller,  variability  of  the  threshold 
voltage  of  transistors  increases.  Therefore,  achieving  con¬ 
sistent  switching  becomes  difficult.  As  a  result,  design  and 
test  cost  has  also  increased  [9].  Due  to  this,  custom  chips  are 
feasible  only  for  mass-production  products,  such  as  games 
and  cellular  phones.  In  addition,  the  life  of  today’s  prod¬ 
ucts  is  short:  every  few  months,  new  products  are  devel¬ 
oped.  Thus,  the  number  of  newly  developed  VLSIs  has 
been  reduced.  Instead,  microprocessors,  application  specific 
standard  products  (ASSPs),  and  field  programmable  gate  ar¬ 
rays  (PPG As)  are  used  to  implement  electronic  appliances. 
These  can  be  customized  by  writing  programs. 

2.  Introduction  of  Branching  Program  Machines 

In  the  rest  of  this  paper,  we  focus  on  branching  program 
machines,  which  are  suitable  for  control  applications.  They 
are  programmable,  since  major  parts  consist  of  memories. 
Because  memory  is  involved,  reliability  can  be  improved  by 
using  traditional  techniques,  such  as  error  correcting  codes 
(ECC). 

Branching  program  machines  for  BDDs  have  been 
used  in  control  applications  [6],  [10]— [12].  Past  response  is 
especially  important  in  control  applications  in  which  there 
are  usually  hundreds  of  inputs.  Lor  such  applications,  a  gen¬ 
eral  purpose  microprocessor  (MPU)  cannot  meet  the  speed 
requirements.  A  branching  program  machine  can  be  several 
times  faster  than  an  MPU:  An  ordinary  MPU  requires  two 
or  three  machine  instructions  to  read  and  test  one  input  vari¬ 
able,  while  the  branching  program  machine  requires  just  one 
instruction  [7] . 


Fig.  2  An  example  of  BDD. 


x4 


MUX 

Fig.  3  MUX  circuit. 


Parallelization  can  be  implemented  by  multi-way 
branching  programs.  Thus,  performance  can  be  improved 
without  increasing  the  clock  frequency. 

2.1  Conversion  from  a  Circuit  to  a  Branching  Program 
Machine 

Consider  the  implementation  of  a  given  logic  function. 
This  can  be  represented  by  a  binary  decision  diagram 
(BDD).  Ligure  2  shows  the  BDD  of  an  example  function, 
f(x i,X2,  X3,  X4)  =  X\X2  V  (X3  ©  X4).  In  this  diagram,  dotted 
lines  (left  lines)  correspond  to  xt  -  0  and  solid  lines  (right 
lines)  correspond  to  xi  -  1 .  By  replacing  each  non-terminal 
node  of  a  BDD  with  a  multiplexer  (MUX),  we  have  a  cir¬ 
cuit,  at  the  top  of  Pig.  3,  that  realizes  the  given  logic  function 
whose  BDD  is  shown  in  Pig.  2. 

However,  such  implementation  requires  dedicated  in¬ 
terconnections  and  expensive  masks.  A  branching  program 
machine  is  a  sequential  circuit  that  emulates  the  MUX  cir¬ 
cuit.  In  this  case,  the  interconnections  are  programmed  in 
a  memory.  Thus,  by  using  a  branching  program  machine,  a 
logic  function  is  implemented  by  logic  and  memory.  Since  it 
has  no  instruction  fetch,  it  is  faster  and  dissipates  less  power 
than  a  general  purpose  microprocessor. 

Unfortunately,  a  branching  program  machine  is  slower 
than  the  original  logic  circuit,  since  it  emulates  the  cir¬ 
cuit  sequentially.  A  straightforward  method  to  increase  the 
speed  is  to  increase  the  clock  frequency.  However,  this  is 
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difficult  in  most  cases.  To  increase  processing  speed  without 
increasing  the  clock  frequency,  we  use  a  Multi-valued  Deci¬ 
sion  Diagram  (MDD).  For  example,  when  two  variables  are 
evaluated  at  the  same  time,  the  decision  diagram  has  four¬ 
way  branches;  this  is  called  a  Quarternary  Decision  Diagram 
(QDD).  In  this  way,  performance  is  increased  without  in¬ 
creasing  the  clock  frequency.  Such  an  idea  is  used  in  VLIW 
processors  [21],  where  branch  instructions  are  multiway. 

2.2  Optimization  of  Branching  Program  Machine 

A  Quartemary  Decision  Diagram  (QDD)  machine  is  up  to 
two  times  faster  than  a  BDD  machine.  However,  instruction 
words  for  the  QDD  machine  require  four  address  fields,  i.e., 
instructions  with  many  bits  are  necessary.  This  increases 
the  power  dissipation,  which  is  proportional  to  the  number 
of  bits  in  the  instruction  words. 

Optimization  of  code  for  a  QDD  machine  can  be 
treated  as  an  optimization  of  a  4-valued  logic  circuit.  A 
multi-core  system  of  128  QDD  machines  was  implemented 
on  an  FPGA  [24] .  This  is  up  to  96  times  faster  than  the  mi¬ 
croprocessor  (Core2Duo,  1.2  GHz,  U7600),  even  though  the 
QDD  machine  runs  at  100  MHz,  while  the  microprocessors 
run  at  1.2  GHz.  Further,  the  power  dissipation  of  128  QDD 
machine  is  only  a  quarter  of  the  microprocessor. 

The  rest  of  this  paper  is  organized  as  follows:  Sec¬ 
tion  3  introduces  a  method  to  represent  multi-output  logic 
functions  by  multi-valued  decision  diagrams.  Section  4  in¬ 
troduces  branching  program  machines:  It  introduces  both 
a  4-address  QDD  machine  and  a  3-address  QDD  machine. 
The  3-address  QDD  machine  requires  less  memory  than  the 
4-address  QDD  machine.  Section  5  shows  an  optimization 
problem  of  codes  for  3-address  QDD  machines.  Section  6 
shows  the  experimental  results.  And  finally,  Sect.  7  con¬ 
cludes  the  paper. 

3.  Representation  of  Multiple-Output  Functions 

3 . 1  Multi-Valued  Decision  Diagrams 

An  arbitrary  n  variable  logic  function  can  be  represented 
by  a  binary  decision  diagram  (BDD).  Evaluation  of  a  BDD 
requires  n  table  look-ups.  Figure  4  shows  an  example  of 
an  MTBDD  (multi-terminal  binary  decision  diagram).  In 
this  case,  many  outputs  can  be  evaluated  at  the  same  time. 
To  further  speed  up  the  evaluation,  a  multiple- valued  deci¬ 
sion  diagram  (MDD)  is  used.  In  the  MDD(k),  k  variables 
are  grouped  to  form  a  2^-valued  super  variable.  To  evalu¬ 
ate  the  MDD(k),  we  need  at  most  [|]  table  look-ups  [20], 
[25].  When  the  function  is  represented  by  an  MDD(k),  the 
evaluation  of  a  logic  function  can  be  k  times  faster  than  the 
corresponding  BDD1".  Thus,  a  larger  k  yields  a  faster  eval¬ 
uation  of  the  MDD(k).  Unfortunately,  the  size  of  memory 
to  represent  a  node  for  an  MDD(k)  is  proportional  to  2k ,  as 
shown  in  Fig.  5.  For  many  benchmark  functions,  the  total 
size  of  the  memory  for  an  MDD(&)  achieves  its  minimum 
when  k  -  2  [25].  Therefore,  in  logic  evaluation,  MDD(2)s 


Fig.  4  Example  of  an  MTBDD. 
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Fig.  5  Nodes  for  MDD(k). 
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Fig.  6  Conversion  of  BDD  to  MDD(2). 


are  more  suitable  than  BDDs.  Since  nodes  in  an  MDD(2) 
have  4  branches,  it  is  termed  a  Quarternary  Decision  Dia¬ 
gram  (QDD). 

3.2  Optimization  of  MDDs 

In  an  MDD(k),  the  evaluation  of  an  ^-variable  logic  func¬ 
tion  can  be  done  by  at  most  [|]  table  look-ups.  So,  the  ma¬ 
jor  problem  is  the  minimization  of  the  number  of  nodes.  In 
general,  it  is  not  so  easy  to  obtain  an  MDD(k)  with  the  min¬ 
imum  number  of  nodes.  The  following  heuristic  method  is 
used  to  obtain  near  minimal  MDDs: 

1 .  Minimize  nodes  of  the  BDD  by  a  heuristic  method  [27] . 

^his  is  true  only  when  the  MDD(k)  and  the  BDD  are  quasi 
reduced. 
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2.  Partition  the  input  variables  to  generate  an  MDD(&)  [28] . 

Figure  6  shows  an  example  of  a  conversion  from  a  BDD  into 
an  MDD(2).  In  the  above  MDDs,  we  assume  each  group  of 
variables  has  the  same  size.  Such  MDDs  are  homogeneous 
MDDs.  When  the  groups  have  different  sizes,  the  MDD 
is  a  heterogeneous  MDD.  For  simplicity,  in  this  paper,  we 
consider  only  homogeneous  MDDs. 

4.  Branching  Program  Machine 

Special  machines  to  evaluate  MDDs  have  been  devel¬ 
oped  [13]— [15].  Unfortunately,  they  are  unsuitable  for  prac¬ 
tical  applications.  Here,  we  consider  a  machine  whose  ar¬ 
chitecture  is  well-suited  for  evaluating  MDDs,  but  is  easily 
programmed. 

4. 1  2- Address  BDD  Machine 

A  branching  program  for  BDDs  uses  only  two  kinds  of  in¬ 
structions: 

B_Branch  (ADDRQ,  ADDR1) ,  INDEX 
Output  DATA,  and  GOTO  ADDR. 

The  first  one  is  the  binary  branch  instruction  that  is 
similar  to  the  computed  GOTO  statement  of  the  FORTRAN 
language:  If  the  value  of  INDEX  is  equal  to  0,  then  go  to 
ADDRO,  otherwise  goto  ADDR1.  The  second  one  performs 
the  output  operation  followed  by  an  unconditional  GOTO 
operation. 

Example  4.1:  Consider  the  MTBDD  shown  in  Fig.  4.  The 
following  code  evaluates  the  MTBDD: 

NO : B_Branch(N2 , Nl) ,  XI 
N1 : B_Branch(N2 , T4) ,  X2 
N2:B_Branch(N3,N4) ,  X3 
N3:B_Branch(T(S),Tl),  X4 
N4:B_Branch(T2,T3) ,  X4 
TO: Output  0,  and  GOTO  NO 
Tl: Output  9,  and  GOTO  NO 
T2: Output  10,  and  GOTO  NO 

T3: Output  11,  and  GOTO  NO 

T4: Output  15,  and  GOTO  NO 

In  this  example,  DATA  in  Output  DATA  is  the  decimal 
equivalent  of  the  function  output  values  expressed  in  binary 
as  /3 ,  h  >  fi ,  /o  •  (End  of  Example) 

Figure  7  shows  the  architecture  of  the  2-address  BDD  ma¬ 
chine,  where  only  the  circuit  for  the  branching  operation  is 
shown.  The  first  field,  COM,  of  the  branching  instruction 
specifies  the  branch  command.  The  second  field,  INDEX, 
specifies  the  index  i  of  the  input  variables  xt.  It  determines 
which  variables  to  select.  The  input  selector  in  Fig.  7  pro¬ 
duces  the  value  of  the  variable  xt  selecting  the  next  branch 
address.  When  X[  =  0,  ADDRO  is  selected.  Otherwise, 
ADDR1  is  selected.  The  selected  address  is  then  loaded 
into  the  program  counter  (PC).  In  this  way,  the  next  address 


Input  selector 

Fig.  7  2-address  BDD  machine. 


Fig.  8  1 -address  BDD  machine. 


is  specified.  To  reduce  the  width  of  the  instruction  words, 

1 - address  BDD  machines  shown  in  Fig.  8  have  been  devel¬ 
oped  [6],  [11],  [18],  [33].  In  this  case,  when  the  value  IN¬ 
DEX  is  1,  the  machine  works  similarly  to  the  case  of  the 

2- address  BDD  machine.  Otherwise,  the  content  of  the  pro¬ 
gram  counter  (PC)  is  incremented  by  one,  to  access  the  next 
address.  In  this  case,  the  size  of  the  instruction  word  is  re¬ 
duced,  but  unconditional  GOTO  instructions  are  necessary, 
as  shown  later. 

4.2  4- Address  QDD  Machine 

By  simultaneously  evaluating  two  binary  variables  and  by 
increasing  the  number  of  branch  addresses  to  four,  we  have 
a  branch  instruction  for  a  4-address  QDD  machine.  Since 
it  evaluates  two  binary  variables  at  a  time,  it  can  reduce  the 
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Input  selector 


Fig.  9  4-address  QDD  machine. 

[Branch  |INDEX|  APPRO  |  ADDR1 1  ADDR2  |  ADDR3  | 

Fig.  10  Branch  instruction  for  4- address  QDD  machine. 

f  Output  |  Address  |  Output  Values  | 

Fig.  11  Output  instruction  for  a  QDD  machine. 

evaluation  time  to  half  that  of  the  2-address  BDD  machine. 

A  branching  program  for  4-address  QDD  machines 
consists  of  two  kind  of  instructions: 

Q_Br anch (ADDR® , ADDR1 , ADDR2 , ADDR3) , INDEX 
Output  DATA,  and  GOTO  ADDR 

Figure  10  shows  the  format  for  the  branch  instruction.  Fig¬ 
ure  9  shows  the  architecture  of  the  4-address  QDD  ma¬ 
chine,  where  only  the  circuit  for  the  branching  operation  is 
shown.  The  first  field  of  the  branching  instruction  specifies 
the  branch  command.  The  second  field,  INDEX,  specifies 
the  index  i  of  the  input  variable  Xt.  It  determines  which 
variables  to  select.  In  the  case  of  a  QDD,  two  consecutive 
binary  variables  are  selected  at  a  time.  The  input  selector 
shown  in  Fig.  9  produces  Xt.  The  upper  multiplexer  selects 
the  variable.  When  Xt  =  (0, 0),  ADDR0  is  selected;  when 
Xi  =  (0, 1),  ADDR1  is  selected;  whenX*  =  (1,0),  ADDR2  is 
selected;  and  when  Xt  =  (1,1),  ADDR3  is  selected.  The  se¬ 
lected  address  is  then  loaded  into  the  program  counter  (PC). 
In  this  way,  the  next  address  is  specified  as  a  function  of  IN¬ 
DEX  i  and  the  input  variable  Xt.  Note  that  this  instruction 
requires  a  rather  long  word,  which  would  be  expensive  for 
embedded  applications. 

Figure  1 1  shows  the  format  for  the  output  instruction. 
The  left  field  specifies  the  instruction  type:  Output.  The 
middle  field  contains  the  address  to  which  this  program 
should  jump.  The  right  field  is  the  output  value,  as  shown  at 
the  bottom  of  the  QDD. 


|  BranchO  |  INDEX  |  ADDR1  |  ADDR2  |  ADDR3~| 

Fig.  12  Branch  instruction  for  a  3-address  QDD  machine. 


Input  selector 


Fig.  13  3-address  QDD  machine. 

4.3  3-Address  QDD  Machine 

Since  the  4-address  QDD  instruction  requires  a  long  word, 
we  developed  a  3-address  QDD  machine.  The  branch  in¬ 
struction  for  the  3 -address  QDD  machine  contains  only 
three  address  fields.  For  example,  consider  the  instruction 
shown  in  Fig.  12.  This  instruction  is  symbolically  denoted 
by 

Q_Branch(+l , ADDR1 , ADDR2 , ADDR3) , INDEX . 

In  this  instruction,  ADDR1,  ADDR2,  and  ADDR3  are  spec¬ 
ified,  but  ADDR0  is  missing.  ADDR0  is  replaced  by  “+1”, 
which  corresponds  to  the  next  address  of  the  current  instruc¬ 
tion.  This  instruction  performs  the  following  operations: 

•  Let  i  be  the  value  of  INDEX.  If  (i  =  0)  then  goto 
the  next  address  of  the  current  instruction,  else  goto 
ADDR  L 

Lemma  4.1:  An  arbitrary  QDD  can  be  evaluated  by  a  pro¬ 
gram  consisting  of  the  following  instructions: 

Q_Branch(+l , ADDR1 , ADDR2 , ADDR3) , INDEX 
GOTO  ADDR 

Output  DATA,  and  GOTO  ADDR 

For  example,  the  instruction  for  the  4-address  QDD  machine 

Q_Br anch (ADDR® , ADDR1 , ADDR2 , ADDR 3) , INDEX 

can  be  simulated  by  the  pair  of  instructions: 

Q_Branch(+l , ADDR1 , ADDR2 , ADDR3) , INDEX 
GOTO  ADDR® 

Note  that  the  last  instruction  is  an  unconditional  GOTO 
statement.  As  shown  in  the  next  section,  the  number  of  un¬ 
conditional  GOTO  statements  can  be  minimized  by  an  opti¬ 
mization  algorithm.  Figure  13  shows  the  architecture  of  the 
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r~BranchO  |  INDEX  |  ADDR1  |  ADDR2  |  ADDR3  | 

piranchl  |  INDEX  |  APPRO  |  ADDR2  |  ADDR3  | 

|~B^nch2  |  INDEX  |  APPRO  |  ADDR1  |  ADDR3  | 

|  Branch3  |  INDEX  |  APPRO  |  ADDR1  |  ADDR2  | 

Fig.  14  Four  types  of  branch  instructions  for  3-address  QDD  machine. 

3-address  QDD  machine,  where  only  the  circuit  for  branch¬ 
ing  operations  is  shown.  Consider  the  instruction  in  Fig.  12. 
When  the  value  of  INDEX  and  the  input  variables  are  non¬ 
zero,  the  machine  is  like  4-address  QDD  machine.  When 
the  value  of  INDEX  and  the  input  variables  are  equal  to  0, 
the  program  counter  (PC)  is  incremented  by  one,  to  access 
the  next  address. 

In  our  hardware  implementation,  we  use  the  four  types 
of  branch  instructions  shown  in  Fig.  14.  To  distinguish  four 
branch  instructions,  we  use  two  additional  bits  in  the  instruc¬ 
tion  field.  However,  as  shown  in  the  experimental  results,  by 
using  four  branch  instructions,  we  can  reduce  the  number  of 
instructions  and  the  total  bit  size.  So,  the  cost  of  these  extra 
bits  is  fully  compensated. 

5.  Optimization  of  Codes  for  QDD  Machines 

In  this  section,  we  consider  a  method  to  reduce  the  num¬ 
ber  of  instructions  for  QDD  machines.  Interestingly,  this  is 
solved  by  minimizing  the  number  of  unconditional  GOTO 
statements. 

Definition  5.1:  Given  the  QDD  and  an  order  of  the  input 
variables  (e.g.  x\,  X2, . . . ,  andv„),  the  code  size  CSIZE  is  the 
number  of  instructions  needed  to  compute  the  Decision  dia¬ 
gram  on  a  given  machine.  Let  4aQDDM  denote  a  4-address 
QDD  machine,  and  let  3aQDDM  denote  a  3-address  QDD 
machine. 

Lemma  5.2:  Let  Nn  be  the  number  of  non-terminal  nodes, 
and  let  Nj  be  the  number  of  terminal  nodes  in  a  QDD.  We 
have  the  following  relation: 

CS  IZE(AaQDDM)  =  Nn  +  NT.  (1) 

(Proof)  In  a  4-address  QDD  machine,  a  non-terminal  node 
is  represented  by  a  branch  instruction,  and  a  terminal  node 
is  represented  by  an  output  instruction.  (Q.E.D.) 

Lemma  5.3:  Let  Nn  be  the  number  of  non-terminal  nodes 
and  let  Nr  be  the  number  of  terminal  nodes  in  a  QDD.  Let 
N(j  be  the  number  of  unconditional  GOTO  statements  that 
are  not  part  of  output  statements.  Then,  we  have  the  follow¬ 
ing  relations: 

CSIZEOaQDDM)  =  Nv  +  Nn  +  NT  (2) 

0  <Nu<Nn  (3) 

(Proof)  In  a  3 -address  QDD  machine,  a  non-terminal  node 
is  represented  by  either  a  branch  instruction  or  a  pair  con¬ 
sisting  of  a  branch  instruction  and  an  unconditional  GOTO 


statement.  Also,  a  terminal  node  is  represented  by  an  out¬ 
put  instruction.  Thus,  the  number  of  unconditional  GOTO 
statements  is  at  most  the  number  of  non-terminal  nodes. 

(Q.E.D.) 

In  the  case  of  a  4-address  QDD  machine,  there  is  no 
code  optimization  problem,  i.e.,  the  instructions  can  be  gen¬ 
erated  in  any  order.  However,  in  the  case  of  a  3-address 
QDD  machine,  the  length  of  the  program  depends  on  the 
order  of  instructions. 

Example  5.2:  Consider  the  QDD  shown  in  Fig.  15.  It  has 
five  non-terminal  nodes,  and  four  terminal  nodes.  When  the 
code  is  generated  in  breadth-first  order,  i.e.,  in  the  order  of 
Xi,  X2  and  X3 ,  we  have  the  following: 

/**  Code  with  Unconditional  GOTO  **/ 

NO : Q_Branch(+l , Nl , Nl , Nl) , XI 
Q_Branch(+l,N3,N3,N3) ,X2 
GOTO  N2 

Nl : Q_Branch(+l , T3 , T3 , T3) , X2 
GOTO  N3 

N2 : Q_Branch(+l , T1 , T1 , Tl) , X3 
GOTO  TO 

N3 : Q_Branch(+l , T2 , T2 , T2) , X3 
GOTO  Tl 

TO: Output  0,  and  GOTO  NO 
Tl: Output  1,  and  GOTO  NO 
T2 : Output  2 ,  and  GOTO  NO 
T3: Output  3,  and  GOTO  NO 

Note  that,  the  above  program  has  four  unconditional  GOTO 
statements  that  are  not  part  of  output  statements.  However, 
when  the  code  is  generated  in  depth-first  order,  it  has  no 
unconditional  GOTO  statements  that  are  not  part  of  output 
statements.: 

/**  Code  without  Unconditional  GOTO  **/ 

NO : Q_Branch(+l , Nl , Nl , Nl) , XI 
Q_Branch(+l,N3,N3,N3) ,X2 
Q_Branch(+l , Tl , Tl , Tl) ,X3 
TO: Output  0,  and  GOTO  NO 
Nl : Q_Branch(+l , T3 , T3 , T3) , X2 
N3 : Q_Branch(+l , T2 , T2 , T2) , X3 
Tl: Output  1,  and  GOTO  NO 
T2 : Output  2 ,  and  GOTO  NO 
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T3: Output  3,  and  GOTO  NO 

Note  that  the  first  four  instructions  correspond  to  the  left¬ 
most  path  from  the  root  node  to  the  terminal  node  TO.  The 
next  three  instructions  correspond  to  the  path  from  node  Nl, 
node  N3,  and  terminal  node  T1 .  (End  of  Example) 

The  code  optimization  problem  for  a  3 -address  QDD  ma¬ 
chine  can  be  reduced  to  a  graph  covering  problem  as  fol¬ 
lows: 

Definition  5.2:  A  path  cover  of  a  QDD  is  a  set  of  paths 
such  that  every  node  in  the  QDD  belongs  to  exactly  one 
path.  A  minimal  path  cover  is  a  path  cover  with  the  fewest 
paths.  A  path  in  a  QDD  can  consist  of  just  one  node. 

Theorem  5.1:  An  optimal  code  for  a  3-address  QDD  ma¬ 
chine  corresponds  to  a  minimal  disjoint  path  cover  of  the 
QDD. 

(Proof)  A  path  in  a  QDD  corresponds  to  a  sequence  of 
Q_Branch  instructions  followed  by  an  output  instruction.  A 
sequence  of  Q_Branch  instructions  without  an  output  in¬ 
struction  requires  an  unconditional  GOTO  statement.  By 
Lemma  5.3,  minimization  of  the  number  of  unconditional 
GOTO  statements  minimizes  the  code  size.  (Q.E.D.) 

6.  Experiment  and  Observation 

6.1  B  enchmark  Results 

To  see  the  effectiveness  of  QDDs  over  BDDs,  and  the  effec¬ 
tiveness  of  the  code  optimization,  we  realized  certain  bench¬ 
mark  functions  by  BDDs  and  QDDs.  First,  we  compare 


QDDs  and  BDDs  with  respect  to  the  numbers  of  nodes. 
Then,  we  convert  these  into  code  for  BDD  and  QDD  ma¬ 
chines,  and  the  number  of  instructions. 

Table  1  shows  the  experimental  results.  Func.  name  de¬ 
notes  the  name  of  the  benchmark  functions;  #  Inp.  denotes 
the  number  of  input  variables;  #  Out.  denotes  the  number 
of  outputs;  BDD  Nodes  denotes  the  number  of  nodes  of  the 
MTBDD  including  both  terminal  and  non-terminal  nodes; 
Opt.  Codes  under  BDD  denotes  the  number  of  instructions 
of  the  optimized  code  for  the  1 -address  BDD  machine  (near 
optimal  solution);  Term.  Nodes  denotes  the  number  of  termi¬ 
nal  nodes;  Aver.  Inst,  under  BDD  denotes  the  average  num¬ 
ber  of  instructions  to  evaluate  an  input  vector  by  a  1 -address 
BDD  machine;  QDD  Nodes  denotes  the  number  of  nodes 
of  the  MTQDD  including  both  terminal  and  non-terminal 
nodes,  that  is  the  same  as  the  number  of  instructions  for  a  4- 
address  QDD  machine;  X  =  00  Codes  under  QDD  denotes 
the  number  of  instructions  in  the  code  for  3-address  QDD 
machine,  when  only  the  first  type  of  instruction  in  Fig.  14  is 
used;  Opt.  Codes  under  QDD  denotes  the  number  of  instruc¬ 
tions  of  the  optimized  code  for  the  3-address  QDD  machine, 
when  all  four  types  of  instructions  in  Fig.  14  are  used  to  min¬ 
imize  the  number  of  GOTO  statements;  X  =  00  GOTO  de¬ 
notes  the  number  of  GOTO  statements,  when  only  one  type 
of  branching  instruction  is  used;  Opt.  GOTO  =  (Opt.  Codes 
-QDD.  Nodes)  under  QDD  denotes  the  number  of  GOTO 
statements,  when  four  types  branching  instructions  are  used; 
Aver.  Inst,  in  QDD  denotes  the  average  number  of  instruc¬ 
tions  to  evaluate  an  input  vector  by  a  3-address  QDD  ma¬ 
chine;  and  Ratio  denotes  the  value:  (Aver.  Inst,  in  1 -address 
BDD  machine)/(Aver.  Inst,  in  3-address  QDD  machine). 


Table  1  Number  of  nodes  and  code  sizes  for  BDD  machine  and  QDD  machine. 


BDD 

QDD 

Func. 

# 

# 

BDD 

Opt. 

Term. 

Aver. 

QDD 

X=00 

Opt. 

x=oo 

Opt. 

Aver. 

Ratio 

Name 

Inp. 

Out. 

Nodes 

Codes 

Nodes 

Inst. 

Nodes 

Codes 

Codes 

GOTO 

GOTO 

Inst. 

C432 

36 

7 

1779 

1779 

128 

19.10 

1027 

1408 

1027 

381 

0 

12.73 

1.50 

amd 

14 

24 

206 

206 

84 

5.63 

164 

171 

164 

7 

0 

3.47 

1.62 

apex2 

39 

3 

335 

363 

8 

6.66 

231 

332 

265 

101 

34 

4.99 

1.33 

apex4 

9 

19 

749 

750 

319 

8.24 

600 

639 

601 

39 

1 

4.61 

1.79 

chkn 

29 

7 

220 

241 

28 

7.01 

157 

215 

172 

58 

15 

5.16 

1.36 

duke2 

22 

29 

636 

637 

255 

6.36 

546 

594 

547 

48 

1 

4.09 

1.55 

gary 

15 

11 

228 

232 

70 

5.51 

173 

191 

174 

18 

1 

3.42 

1.61 

inO 

15 

11 

195 

200 

52 

5.02 

145 

170 

148 

25 

3 

2.92 

1.72 

ini 

16 

17 

284 

299 

55 

6.85 

217 

288 

229 

71 

12 

4.70 

1.46 

in2 

19 

10 

291 

296 

73 

3.98 

219 

262 

225 

43 

6 

2.60 

1.53 

in3 

35 

29 

259 

259 

72 

6.63 

214 

234 

214 

20 

0 

4.77 

1.39 

in4 

32 

20 

607 

611 

178 

4.69 

491 

569 

495 

78 

4 

3.44 

1.36 

in5 

24 

14 

461 

466 

134 

8.54 

369 

452 

371 

83 

2 

6.57 

1.30 

in6 

33 

23 

4325 

4338 

1638 

7.51 

3546 

3815 

3555 

269 

9 

5.88 

1.28 

in7 

26 

10 

300 

301 

112 

7.58 

256 

275 

256 

19 

0 

5.84 

1.30 

ml81 

15 

9 

222 

222 

84 

6.80 

196 

217 

196 

21 

0 

4.71 

1.44 

misex2 

25 

18 

113 

113 

35 

4.97 

91 

96 

91 

5 

0 

3.60 

1.38 

misex3 

14 

14 

2910 

2975 

1041 

7.55 

1773 

2159 

1773 

386 

0 

4.05 

1.86 

misi 

35 

14 

4656 

4656 

1408 

14.12 

3275 

3828 

3275 

553 

0 

9.57 

1.47 

mlp6 

12 

12 

5270 

6062 

1238 

12.10 

2582 

2966 

2694 

384 

112 

5.98 

2.02 

rise 

8 

31 

56 

56 

28 

4.42 

44 

44 

44 

0 

0 

2.55 

1.74 

signet 

39 

8 

7347 

8652 

128 

18.23 

5671 

8374 

6907 

2703 

1236 

13.31 

1.37 

tial 

14 

8 

697 

790 

49 

12.05 

388 

552 

466 

164 

78 

6.37 

1.89 

vg2 

25 

8 

131 

135 

24 

7.65 

89 

110 

91 

21 

2 

5.62 

1.36 

xldn 

27 

6 

200 

218 

18 

9.55 

126 

171 

141 

45 

15 

5.74 

1.66 

x6dn 

39 

5 

214 

231 

28 

4.14 

159 

215 

177 

56 

18 

2.74 

1.52 

x9dn 

27 

7 

204 

222 

22 

9.30 

140 

188 

157 

48 

17 

5.80 

1.60 
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6.2  Detail  of  the  Experiment 

Optimization  of  Decision  Diagrams:  First,  the  ordering 
that  minimizes  the  size  of  the  MTBDD  is  obtained.  Then, 
the  input  variables  are  partitioned  into  groups  of  two  vari¬ 
ables  in  the  natural  order  to  obtain  the  MTQDDs. 
Optimization  of  Code:  Theorem  5.1  shows  how  to  mini¬ 
mize  the  number  of  instructions  by  minimizing  the  number 
of  GOTO  statements.  The  algorithm  given  by  [16]  is  only 
applicable  to  the  program  with  nodes  whose  in-degrees  and 
out-degrees  are  both  two.  So,  we  developed  our  own  algo¬ 
rithm  to  obtain  near  optimal  solutions  for  our  more  general 
case. 

6.3  Observations 

From  the  table,  we  can  observe  the  following: 

•  The  number  of  nodes  in  QDDs  is  smaller  than  that  of 
BDDs. 

•  The  number  of  instructions  for  the  3 -address  QDD  ma¬ 
chine  can  be  considerably  reduced  by  an  optimization 
algorithm. 

•  For  C432,  in3,  misex2,  misj ,  and  rise ,  the  number  of 
GOTO  statements  in  the  optimized  QDD  codes  is  zero. 
This  means  that  optimal  code  is  generated  for  these 
functions.  Also,  for  these  functions,  optimal  code  for 
BDD  machines  are  generated. 

•  signet  requires  many  GOTO  statements  in  both  BDD 
and  QDD  machines.  The  number  of  GOTO  state¬ 
ments  for  a  BDD  machine  is  given  by  (Opt.  Codes)  - 
(BDD  Nodes)  =  8671  -  7347  =  1324. 

•  Opt.  Codes ,  the  number  of  instructions  for  a  3-address 
QDD  machines  is  often  larger  than  QDD  Nodes ,  the 
number  of  instructions  for  a  4-address  QDD  machine. 
The  column  headed  by  Opt.  GOTO  (=OPT.  Codes  - 
QDD.  Nodes )  shows  the  extra  GOTOs.  Except  for  a 
few  functions,  the  extra  GOTOs  are  rather  small. 

•  Consider  the  value:  (Sum  of  X  =  00  Codes)  -  (Sum 
of  Optimal  Codes)  =  28535  -  24528  =  4007.  This 
shows  the  total  number  of  instructions  reduced  by  us¬ 
ing  four  types  of  branch  instructions,  instead  of  us¬ 
ing  only  one  type  of  branching  instructions.  How¬ 
ever,  to  specify  four  types  of  instructions,  we  need 
two  additional  bits  in  the  instruction  field.  Fet  w  be 
the  number  of  bits  in  a  word  in  the  3 -address  QDD 
machine,  where  only  one  type  of  branching  instruc¬ 
tion  is  used.  Then,  the  merit  of  using  four  types  of 
instructions  is  accurately  expressed  as:  (Sum  of  X  = 
00  Codes)  x  w  -  (Sum  of  Opt.  Codes)  x  (vr  +  2)  = 
28535vp  -  24528(w  +  2)  =  4007w  -  49056.  Note  that, 
in  most  cases,  vr  >  20,  so  we  can  conclude  that  the  use 
of  four  types  of  Q_Branch  instructions  reduces  the  total 
number  of  bits. 

•  The  last  column  of  the  table  shows  that  the  3-address 
QDD  machine  is  1.28  -  2.02  times  faster  than  the  1- 
address  BDD  machine.  Note  that,  for  MLP6 ,  the  ratio 


is  greater  than  2.  This  is  due  to  GOTO  statements.  If 
we  compared  the  average  numbers  of  instructions  in 
a  2-address  BDD  machine  and  a  4-address  QDD  ma¬ 
chine,  the  ratio  is  at  most  2. 

6.4  Hardware  Implementation 

To  show  the  usefulness  of  multi-core  QDD  machines, 
we  have  developed  a  parallel  branching  program  machine 
(PBM128)  consisting  of  128  QDD  machines  and  a  pro¬ 
grammable  interconnection  on  Altera’ s  Stratix  II  FPGA. 
We  realized  many  benchmark  functions  on  the  PBM128, 
and  compared  its  memory  size  and  computation  time  with 
Intel’s  Core2Duo  microprocessor.  PBM128  requires  ap¬ 
proximately  one  quarter  of  the  memory  required  by  the 
Core2Duo,  and  is  21.4-96.1  times  faster  than  the  Core2Duo. 
Details  are  shown  in  [24] . 

7.  Conclusions 

In  this  paper,  first,  we  review  the  trends  of  VFSI  design,  fo¬ 
cusing  on  the  power  dissipation  and  programmability.  Then, 
we  considered  a  branching  program  machine  to  evaluate 
multiple-output  logic  functions.  To  increase  the  speed  of 
evaluation,  we  used  QDDs  instead  of  BDDs.  To  reduce 
the  memory  size,  we  used  3-address  QDD  machines  in¬ 
stead  of  4-address  QDD  machines.  We  proposed  the  use 
of  four  types  of  branch  instructions.  Also,  we  considered 
a  method  to  optimize  codes  for  3-address  QDDs.  This  is 
different  from  existing  methods  to  optimize  the  decision  di¬ 
agrams.  We  show  that  the  minimization  of  the  number  of 
instructions  corresponds  to  minimizing  the  number  of  un¬ 
conditional  GOTO  statements.  For  various  benchmark  func¬ 
tions,  we  optimized  the  codes,  and  showed  the  effectiveness 
of  the  approach. 
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